This report investigates the frequencies of motor vehicle collisions occurring in New York City (NYC) according to the time of day. Our results show that the frequency of accidents occurring per day have remained fairly constant. However, an increase in motor vehicle collisions during the time period of 8am to 9am was observed, and the highest number of road fatalities occurred during 4pm-6pm. These findings reflect the NYC ‘rush hour’ when people are commuting home from work or going out in the evening. In particular, Friday 4pm-6pm had the highest number of fatalities during the week, whereas Saturday and Sunday had significantly fewer accidents compared to weekdays. The number of motor vehicle collisions occurring on Monday, Tuesday, Wednesday, and Thursday were similar. These findings inform the NYC Fire Department (FDNY) and (FDNY) Bureau of Emergency Medical Services (EMS) of the times they must be most alert. Words: 146 words
Motor vehicle collisions are a major cause of death and injury in New York (City of New York, 2021). This report aims to inform stakeholders of the most common time of day for motor vehicle collisions to occur in NYC from 2012-2021. Relevant stakeholders include the NYC Fire Department (FDNY) and (FDNY) Bureau of Emergency Medical Services (EMS).
The Motor Vehicle Collisions crash data was sourced from NYC OpenData and was provided by the NYPD for public safety purposes (NYC OpenData, 2021). The dataset is classified as free public data and the NYC OpenData website includes thorough information on attribution, creation date, and the data generation process (NYC OpenData, 2021). Data was collected by police officers, who completed a MV-104AN report for all vehicle collisions in NYC (NYC OpenData, 2021). Only very basic data was collected from 1999-2016, but more detailed information was collected from 2016 (NYC OpenData, 2021). An additional limitation is the absence of data prior to 2012, which prevents our ability to analyse trends over a longer period. The data is reliable as it was inputted by trained police officers (NYC OpenData, 2021). However, potential human error must still be considered, and variations in data collection may exist between individual police officers. The dataset is updated daily and a ‘MVCDataDictionary’ spreadsheet records revision history (NYC OpenData, 2021).
Previous statistics indicate that car collisions are more common in NYC on weekdays during lunch time and the evening peak hour when individuals are commuting (Sullivan & Galleshaw LLP, 2021). During 9pm-3am, collisions occur more frequently on weekends (Sullivan & Galleshaw LLP, 2021). For ethics and privacy purposes, the Motor Vehicle Collisions dataset does not reveal confidential information about individuals.
The dataset consists of 1.7 million rows and 29 columns. Each row corresponds to a motor vehicle collision, whereas the columns provide details of the collision.
library(tidyverse) # piping `%>%`, plotting, reading data
library(skimr) # exploratory data summary
library(naniar) # exploratory plots
library(kableExtra) # tables
library(lubridate) # for date variables
library(plotly)
library(dplyr)
nyc = read.csv("MVC.csv")
#nyc %>% glimpse()
#nyc %>% summary()
#vis_miss(nyc, warn_large_data = FALSE)
nyc %>%
select_if(function(x) any(is.na(x))) %>%
summarise_each(funs(sum(is.na(.)))) -> NAtable
kable(NAtable)
| ZIP.CODE | LATITUDE | LONGITUDE | NUMBER.OF.PERSONS.INJURED | NUMBER.OF.PERSONS.KILLED | VEHICLE.TYPE.CODE.2 |
|---|---|---|---|---|---|
| 490687 | 197179 | 197179 | 17 | 31 | 2 |
cleannyc <- nyc[!(nyc$LONGITUDE == "" | nyc$LATITUDE == "" | nyc$LOCATION == "" | nyc$LATITUDE == 0 | nyc$LONGITUDE == 0),]
#cleannyc %>% glimpse()
#table(is.na(cleannyc))
cleannyc = na.omit(cleannyc)
#vis_miss(cleannyc, warn_large_data = FALSE)
boxplot(cleannyc[,11:18],cex.axis = 0.6, las = 1, horizontal = TRUE,par(mar= c(5, 10, 4, 2) + 0.1))
As we can see in the boxplot above, there are many outliers especially in the number of motorist injured, and number of persons injured. Upon inspection when there was the number of motorist injured, it occurred at 9/9/2013 and is a Brooklyn Bus Accident which left 43 people injured when a car collided head on with a bus. And it also turns out that this is the same entry for the outlier in number of persons injured. The reasons why it is in both persons and motorist category is because the 43 people are in the bus, therefore classified as motorists. These outliers without inspection may seem extraordinary and perhaps a possibility of being faulty data collection, however with a further glance they seem to be valid and an important part of our data analysis. In fact in comparison to the mean and median of all these columns, most of the circles shown in the graph are considered outliers. Evidently the median and mean are around 0 accidents, which is expected. Because we have so many entries in data, and the probability of being in an accident is relatively small, this graph is exposed to skewedness, and thus we cannnot say that all these data points greater than 0 are outliers.
#max(cleannyc$NUMBER.OF.MOTORIST.INJURED)
#cleannyc %>% filter(NUMBER.OF.MOTORIST.INJURED == 43)
#max(cleannyc$NUMBER.OF.PERSONS.INJURED)
#cleannyc %>% filter(NUMBER.OF.PERSONS.INJURED == 43)
Even though these outliers are valid, they will affect our aggregate data, by dragging the mean higher than it should be. This is why median is much better than using mean, as it is not as affected by high outliers. We do not care about low outliers as the base is 0 and cannot fall lower. We can also take a look into more details and the affect of these two outliers using the graph below.
fig = plot_ly(y = cleannyc$NUMBER.OF.PERSONS.INJURED, type = "box", name = "Number of Persons Injured")
fig = fig %>% add_trace(y = cleannyc$NUMBER.OF.MOTORIST.INJURED, name = "Number of Motorists Injured") %>% layout(title = "Persons Injured and Motorist Injured Outlier Analysis")
fig
Here we explore the frequencies of road accidents happening according to the time of the day. To provide emergency assistance to victims, the NYC Fire Department (FDNY) and (FDNY) Bureau of Emergency Medical Services (EMS) has 340 fire engines and 450 ambulances respectively1.
In the graph below we see the trend of no. of road accidents happening over the course of 2012 to 2019.
as.Date(nyc$CRASH.DATE, tryFormats = c("%m/%d/%Y")) -> nyc$CRASH.DATE
as.data.frame(table(nyc$CRASH.DATE)) -> date_table
plot_ly(date_table, x= ~Var1, y = ~Freq, type = "scatter", mode = "lines", text = paste(date_table$Var1, ", ", date_table$Freq)) -> fig1
fig1 %>% layout(title = "No. of accidents per day in NYC (2012-2019)",
yaxis = list(title = "No. of Accidents (in 1 day)"),
xaxis = list(title = "Date",
type = "date",
range = c("2012-07-01", "2019-12-03"))) -> fig1
fig1
From the above graph we can see that the frequency of road accidents happening per day has been fairly constant over the years. On an average, New York City has 595.8 road accidents per day. But to counter this, the FDNY average response time for vehicle accidents stands at a poor 60 minutes and 27 seconds2 which is 4 times the national average of 15 minutes and 19 seconds3.
According to Mark Cunningham in EMS. One More Time(2015)4, the reason for this poor performance is systemic inefficiency ranging from fewer working hours for EMS drivers to poor algorithm design for assigning operators.
This problem can be addressed by understanding and analyzing the trends in traffic incidences. For instance, it can be observed from the above graph that 25th December is often the least violent day on the road of the year. By exploring more such trends, the FDNY would be better equipped to improve their response time and save numerous lives.
cleannyc %>% separate(CRASH.TIME, c("hour", "min"), ":") -> cleannyc
as.data.frame(table(cleannyc$hour)) -> time_freq
time_freq[order(time_freq$Var1),] -> time_freq$Var1
time_freq$Var1$Freq <- NULL
cleannyc %>% mutate(
killed = (NUMBER.OF.PERSONS.KILLED),
injured = (NUMBER.OF.PERSONS.INJURED)) -> cleannyc
We first look at the the timings of the accidents in the day. The following graph plots the number of road crash victims against the time of the accident.
i = 0
kills=c()
injuries= c()
while (i<24){
sum(cleannyc[cleannyc$hour == i,]$killed) -> killed_sum
sum(cleannyc[cleannyc$hour == i,]$injured) -> injured_sum
kills = c(kills, killed_sum)
injuries = c(injuries, injured_sum)
i = i+1
}
as.data.frame(cbind(time_freq, kills, injuries)) -> time_score
fig3 = plot_ly(time_score, x = ~Var1, y = ~kills, type = 'scatter', mode = "lines+markers", name = "Deaths")
fig3 %>% add_trace(y = ~injuries, name = "Injuries") -> fig3
fig3 <- fig3 %>% layout(yaxis = list(title = 'Count'), barmode = 'group')
fig3 %>% layout(title = "Hourly deaths and injuries from road accidents",
xaxis = list(title = "Time of the day",
ticktext = list("00:00-01:00", "01:00-02:00", "02:00-03:00","03:00-04:00", "04:00-05:00","05:00-06:00","06:00-07:00", "07:00-08:00", "08:00-09:00", "09:00-10:00", "10:00-11:00","11:00-12:00","12:00-13:00", "13:00-14:00", "14:00-15:00", "15:00-16:00", "16:00-17:00", "17:00-18:00", "18:00-19:00", "19:00-20:00", "20:00-21:00", "21:00-22:00", "22:00-23:00", "23:00-00:00"),
tickvals = list(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23)),
yaxis = list(title = "No. of Cases")) -> fig3
fig3
From the graph, we see a spike in accidents between 8 to 9 am. This is when people go to their workplaces. We can also observe that the highest amount of fatalities are found between the timings 4 pm to 6pm.
This is usually the time when the working class return from their offices. It also coincides with people going out in the evening. Thus, this increases the chances of an accident happening. Since people are usually tired after a day’s work, the chances of driving errors is higher as well.
During these times, the FDNY and emergency medical services should be the most alert. Since New York firemen work in 8-hour long shifts, it should be ensured that the shift change should not happen during these periods.
We now explore which days of the week are the most prone to accidents. The graph below plots the number of accidents against the days of the week.
day = weekdays(as.Date(cleannyc$CRASH.DATE, tryFormats = "%m/%d/%Y"))
cleannyc = cbind(cleannyc, day)
as.data.frame(table(cleannyc$day)) -> day_table
factor(day_table$Var1, levels= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) -> day_table$Var1
fig4 = plot_ly(day_table, x = ~Var1, y = ~Freq, type = "bar")
fig4 %>% layout(title = "Weekly trends in Road Accidents (2012-2019)",
yaxis = list(title ="No. of Accidents"),
xaxis = list(title = "Day of the Week")) -> fig4
fig4
Here we observe that Friday has the highest amount of road accidents happening and the weekends: Saturday and Sunday have significantly fewer accidents compared to weekdays.
Fridays tend to be crowded on the roads as most people go out during Friday evenings before the weekend.
Apart from Friday, the no. of accidents occuring is around the same during the weekdays. The weekends are safer since most workplaces are closed and thus, fewer people go out during the day.
The emergency services dept. should consider having longer shifts for emergency personnel on the days with higher traffic.
Apart from this, New York City Fire Department should consider investing into hiring more personnel and increasing emergency stations around the city to reduce the response time. A limitation of this dataset was that it does not contain independent contractual service providers which are common in New York due to its inefficient emergency services. These providers are contractually obligated to provide service in a given timeframe. Hence, they perform much better than the FDNY.
By incorporating the learnings from the above observations, the FDNY can aim to become a competent and trustable emergency service provider.
Without data wrangling we are unable to perform aggregates and data visualisation, due to the missing amounts of data and possible false data entries. This will lead us to make false conclusions and risk the integrity of data analysis.
City of New York. (2021). Vision Zero in New York City. Retrieved from https://www1.nyc.gov/content/visionzero/pages/.
NYC OpenData. (2021). Motor Vehicle Collisions - Crashes. Retrieved from https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95.
Sullivan & Galleshaw LLP. (2021, 2021). How Common are Car Accidents in NYC? Retrieved from https://www.sullivangalleshaw.com/common-car-accidents-nyc/.
DNA Info (2013) 43 People Injured in Bed-Stuy When Car Collides Head-on with City Bus https://www.dnainfo.com/new-york/20130909/bed-stuy/43-people-injured-bed-stuy-when-car-collides-head-on-with-city-bus/
Fleet Report - Mayor’s Office of Operations. (2021). Retrieved 2 July 2021, from https://www1.nyc.gov/site/operations/performance/fleet-report.page
End-to-End Response Time - 911 Reporting . (2021). Retrieved 2 July 2021, from https://www1.nyc.gov/site/911reporting/reports/end-to-end-repsonse-time.page
NHTSA(2021). Retrieved 3 July 2021, from https://www-fars.nhtsa.dot.gov/Main/index.aspx
EMS, One More Time. (2015). Retrieved 4 July 2021, from https://www.city-journal.org/html/ems-one-more-time-12793.html?wallit_nosession=1